ExATOlp: extraction of language resources from Portuguese corpora
نویسندگان
چکیده
This paper presents four main features of the ExATOlp software tool. These features provide the following language resources: corpus relevant terms and their morpho-syntactic and frequency features; concordancer (terms contexts); concept tags; and concept hierarchies. The emphasis of the tool relies on the high quality of extracted terms. The provided resources offer a concise representation of non-obvious characteristics of the extracted terms.
منابع مشابه
Descoberta Automática de Relações Não-Taxonômicas a Partir de Corpus em Língua Portuguesa
Ontology construction is a complex process composed by extraction tasks for domain concepts, as well as taxonomic and non-taxonomic relations among concepts. The extraction of non-taxonomic relations is the most neglected task, specially for Portuguese texts. Therefore, this paper presents a proposal for extracting non-taxonomic relations from Portuguese texts represented by a list of concepts ...
متن کاملExtracting semantic relations from Portuguese corpora using lexical-syntactic patterns
The growing investment on automatic extraction procedures, together with the need for extensive resources, makes semi-automatic construction a new viable and efficient strategy for developing of language resources, combining accuracy, size, coverage and applicability. These assumptions motivated the work depicted in this paper, aiming at the establishment and use of lexical-syntactic patterns f...
متن کاملBuilding a Corpus for Named Entity Recognition using Portuguese Wikipedia and DBpedia
Some natural language processing tasks can be learned from example corpora, but having enough examples for the task at hands can be a bottleneck. In this work we address how Wikipedia and DBpedia, two freely available language resources, can be used to support Named Entity Recognition, a fundamental task in Information Extraction and a necessary step of other tasks such as Co-reference Resoluti...
متن کاملSpeech Recognition for Brazilian Portuguese using the Spoltech and OGI-22 Corpora
Speech processing is a data-driven technology that relies on public corpora and associated resources. In contrast to languages such as English, there are few resources for Brazilian Portuguese (BP). This work describes efforts toward decreasing such gap and presents systems for speech recognition in BP using two public corpora: Spoltech and OGI-22. The following resources are made available: AT...
متن کاملEχATOLP – An Automatic Tool for Term Extraction from Portuguese Language Corpora
This paper describes EχATOLP, a software tool to extract significant terms from an annotated corpus written in portuguese about a specific domain of interest. Being based on linguistic and statistical approaches, this tool extracts terms that are frequent and syntactic relevant to the domain of interest.
متن کامل